Loading Files

1. Archive File

2. Downloading predictions files from Udacity servers programatically

3. Downloading Twitter Json data using Tweepy

Source: https://www.geeksforgeeks.org/python-api-get_status-in-tweepy/

Assessment

In this section, we are going to assess the dataframes created in the gathering section to identify any data quality and data tidiness issues

In this section, we will be identifying the data quality and tidiness issues by visually and programatically

NOTE: There are certain unusal letters like 'a' and 'None' in 'name' column

Visual Assessment Notes

Quality:

Tidiness:

1. There are quite number of dogs are predicted incorrectly

2. The predicted dog names in p1, p2 and p3 are not in uniform format meaning first letter is capitalized for few and few are not. it should be converted to lower case to bring the uniformitly

1. Tweet_id is showing as integet. It should be converted to string

2. comparitively there are 278 tweets missing between archive and predictions file

comparitively there are 26 tweets missing between archive and json file

Assessment Summary

Quality:

Archive:
  1. Tweet ID field should be an object data type, not integers or floats because they are not numeric and aren't intended to perform calculations.

  2. Dog Names : In the name column, there are several values that are not dog names, like 'a', 'the', 'such', etc. Notice that all of these observations have lowercase characters, an important pattern that could be used to clean up this field. Another way is to drop duplicated values

  3. There are lot of missing values such as 'in_reply_to_status_id' and 'in_reply_to_user_id'. This indicates not all tweets have replies. there are only 78 replies

  4. data type of 'timestamp' and 'retweeted_status_timestamp' is incorrect. It should be converted to 'datetime'

  5. There are 181 retweets such as 'retweeted_status_id', 'retweeted_status_user_id' and 'retweeted_status_timestamp'

  6. There are missing urls in 'expanded_urls' column

  7. There are 23 'rating_denominator' that are not equal to 10

Predictions:

  1. Tweet_id is showing as integet. It should be converted to string
Json:
  1. There are 26 missing values

Tidiness:

Archive:

  1. The 4 different columns doggo, floofer, pupper and puppo, are all relative to the same variable that identifies the stage of dog. So, we can melt these columns into a single column named "dog stage"

Prediction and Json:

  1. Identify the best predicted breed type based on confidence and create two new columns with breed and confidence
  1. The df2 and df3 are part of the same observational unit as df1 but there are three separate tables so they should be merged and stored in a file called twitter_archive_master.csv, as per project instructions.

Cleaning

In this section, I will be cleaning the data quality issues and data tidiness issues assessed in the assessment process using Define, Code and Test

1. ID fields: The ID fields, like tweet_id, in_reply_to_status_id etc. should be objects

Define: Tweet ID fields: This field should be objects, not integers or floats because #### this is not numeric and aren't intended to perform calculations.

Note: The other ID fileds are untouched because they are anyways removed in subsequent steps

Code:

Test:

2. Dog Names : In the name column, there are several values that are not dog names, like 'a', 'the', 'such', etc. Notice that all of these observations have lowercase characters, an important pattern that could be used to clean up this field. Another way is to drop duplicated values

Define:

 Identifying the names that are non-dog names and rename them in df_archive

Code:

Code

3. There are lot of missing values such as 'in_reply_to_status_id' and 'in_reply_to_user_id'. This indicates not all tweets have replies. there are only 78 replies

Define

Per the introduction, it is understood that we are only interested to consider the original tweets, not the retweets. So, they can be dropped

Code

Test

4. data type of 'timestamp' and 'retweeted_status_timestamp' is incorrect. It should be converted to 'datetime'

Define

The datatypes for timestamp and retweeted_status_timestamp needs to be corrected to datetime format

Code

Test

5. There are 181 retweets such as 'retweeted_status_id', 'retweeted_status_user_id' and 'retweeted_status_timestamp'

Define

There are 181 retweets. As per the project introduction, we are only interested to consider the original tweets. They need to be dropped

Code

6. There are missing urls in 'expanded_urls' column

Define

There are 59 missing values in 'expanded_urls'. They need to be verified and dropped. However, during the cleaning process, the are already 56 missing expanded_urls rows dropped. so, the resultant missing values are only 3

Code

Test

7. There are 23 'rating_denominator' that are not equal to 10

Define

as per introduction about archive file, it is told that the denominators are always equal to 10. Here, we are trying to identifying rating_denominators that are not equal to 10

Code

Test

8. Predictions File :Tweet_id is showing as integer. It should be converted to string

Define

Tweet_id in the predictions file has to be converted to string since this is not useful in analysis

Code

Test

9. Json - Quality: There are 26 missing values in df_json

Define

The missing values in the df_json file are due to the errors that are removed from df_archive file. Hence, no action is needed

Tidiness:

df_archive file:

1. The 4 different columns doggo, floofer, pupper and puppo, are all relative to the same variable that identifies the stage of dog. So, we can melt these columns into a single column named "dog stage"

Define:

There are 4 columns for dog stages which should be melted to one column for data analysis

and identify the multi stage rows and created one column

Code:

Test

Tidiness | Predictions

2. Identify the best predicted breed type based on confidence and create two new columns with breed and confidence

Define:

There are three predictions of breeds with different prediction confidences. We need to find the best among the three
take the highest predicted breed type and the correspondant confidence and create two columns with these values

Code:

test:

Tidiness | Archive, Prediction and Json files

  1. The df_Prediction_clean and df_json_clean are part of the same observational unit as df_archive_clean but there are three separate tables so they should be merged and stored in a file called twitter_archive_master.csv, as per project instructions.

Additional cleaning

After merging,

Define

fill all NaN to 0 using fillna() method and use to_numeric to convert float to int

Code:

Test:

Define:

dog_stage columns is in string format, this needs to be converted into category as there are finite number of breeds

Code:

Test:

Addtional Cleaning suggested by mentor about rating:

Define : Some of the rating values are not properly extracted from the tweets.

Code:

Test:

Storing data into 'twitter_archive_master.csv'

Data Analysis and Visualization

In this data analysis and visualization process, i am going to use various columns such ratings, dog_stages, breed, retweet and favourite tweets counts

Most of the rating tht fallen between 9 to 12

from the above pairplot, its understood that the favorite_count and retweet_counts appears to be positively correlated